Lets do a short introduction to what we’re working with here.

Target Variable

#original Siebert Data
aei <- read.csv2("/Volumes/RachelExternal/Thesis/Data/SiebertData/HID_v10/Country_tenyear_ha.csv")
aei <- as.data.frame(aei)

#rename the cols so that I can use them as number
colnames(aei) <- c("country", "ISO", "ID", "1900", "1910","1920","1930","1940","1950","1960","1970","1980","1985","1990","1995","2000","2005")

#tidying data to long format
aei <- pivot_longer(aei,cols = 4:17, names_to = "year", values_to = "aei_ha")
aei$aei_ha <- removePunctuation(aei$aei_ha)
aei$year <- as.numeric(aei$year)
aei$aei_ha <- as.numeric(aei$aei_ha)
aei$ISO <- as.factor(aei$ISO)
aei$country <- as.factor(aei$country)
#subset the data so that its from the year 1960
aei <- aei %>% subset(year >= 1960)

#make a new col for year count
aei$yearcount <- (aei$year - 1960)

#remove data for the total global irrigated area
globeaei <- subset(aei, country == "WORLD")
aei= aei[!aei$country == "WORLD",]

str(aei)
tibble [1,848 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 232 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ ISO      : Factor w/ 232 levels "","ABW","AFG",..: 3 3 3 3 3 3 3 3 6 6 ...
 $ ID       : int [1:1848] 2 2 2 2 2 2 2 2 6 6 ...
 $ year     : num [1:1848] 1960 1970 1980 1985 1990 ...
 $ aei_ha   : num [1:1848] 2404990 2408168 2518178 2594394 2903569 ...
 $ yearcount: num [1:1848] 0 10 20 25 30 35 40 45 0 10 ...

Added to this to be able to calculate the % of land irrigated per country we need the country area. This data is taken from Worldbank.

#import the data set with total country area, this comes from the Worldbank
countryarea <- read.csv2("/Volumes/RachelExternal/Thesis/Data/LandArea/LandArea.csv")

#removing extra cols
countryarea <- countryarea[-c(265:270), ]
countryarea <- countryarea[, -c(1, 3:4)]

#fix the col names (in the two days since i found this little for loop I learned how to do this faster!)
for ( col in 2:ncol(countryarea)){
    colnames(countryarea)[col] <-  sub("X", "", colnames(countryarea)[col])
}
remove(col)

#making long data
countryarea <- 
  countryarea %>%  
  pivot_longer(!Country.Code, names_to = "year", values_to = "area_km") %>% 
  rename(ISO = Country.Code)

#fixing variable types
countryarea$ISO <- as.factor(countryarea$ISO)
countryarea$year <- as.numeric(countryarea$year)
str(countryarea)
tibble [16,104 × 3] (S3: tbl_df/tbl/data.frame)
 $ ISO    : Factor w/ 264 levels "ABW","AFG","AGO",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ year   : num [1:16104] 1960 1961 1962 1963 1964 ...
 $ area_km: num [1:16104] NA 180 180 180 180 180 180 180 180 180 ...

For some reason, the data from Worldbank on each countries area only starts in 1961. For this analysis, 1960 is needed. Country area from 1961 will be substituted for 1960, with the assumption that country area doesn’t change drastically in one year (it doesn’t).

head(subset(countryarea, c(year == 1960 | year == 1961))) 

#fixing 1960
countryarea <-
  countryarea %>% 
  group_by(ISO) %>% 
  fill(area_km, .direction = "up") %>% 
  ungroup()

Merge the country area with the AEI df here.

#combine these data sets
aei <- aei %>% merge(y = countryarea, by.x = c("ISO", "year"), all.x = TRUE)

rm(countryarea)
head(aei)

Some countries also have no area value. I’ll go through and see if any of the countries with missing area values have non zero AEI_ha. If so, I’ll replace them, as they didn’t come through with the WorldBank data set but are online. Copying and pasting from various sources should be fine.

#checking for countries with NA values in their area_km col
#looking primarily for countries that have non 0 AEI_ha
aei[is.na(aei$area_km),]
                
#fixing counties with no area
aei$area_km[which(aei$ISO == "GUF")] <- 83534 #French Guiana from Britannica
aei$area_km[which(aei$ISO == "GLP")] <- 1630 #Guadeloupe from Britannica 
aei$area_km[which(aei$ISO == "MTQ")] <- 1128 # Martinique from Britannica
aei$area_km[which(aei$ISO == "REU")] <- 2512 #Reunion from Britannica
aei$area_km[which(aei$ISO == "SMK")] <- 13812 + 10905 + 77589 #Montenegro + Kosovo + Serbia from Britannica
aei$area_km[which(aei$ISO == "TWN")] <- 35980 #Taiwan from CIA.gov
aei$area_km[which(aei$ISO == "SDN")] <- 1882000 #Sudan from UNDP

Now, transform the AEI_ha to \(km^2\) and divide to get the % of total country area that is equipped for irrigation. There are still some countries that have 0 AEI_ha and no area_km value. I think I will assign the irrfrac/perc 0 to those countries, but lets wait a little while to do that, most are islands.

aei <- aei %>% mutate(irrperc = ((aei_ha/100)/area_km)*100, irrfrac = ((aei_ha/100)/area_km))

Here is a quick histogram of the target variable. A lot of 0s.

hist(aei$irrperc, breaks = 200)


Chosen Predictor Variables

According to the FAO (Walker 1989), the choice of which irrigation, and by proxy whether to irrigate or not, comes from six factors. The predictor that will be used for this analysis are listed below:

All variables will be scaled or centered.

First, the ISO codes and Regions to be able to hold some structure.

#the ISO codes and regional data from Gapminder
iso <- read.csv2("/Volumes/RachelExternal/Thesis/Thesis/Data/GAPminder_ISOcodes.csv")
iso <- iso[, -c(2,7,10,12,13)] #removing extra info

#renaming some cols
iso <- 
  iso %>% 
  rename(ISO = GEO, country = name)

#fixing the variable types
iso$ISO <- as.factor(iso$ISO)
iso$country <- as.factor(iso$country)
iso$four_regions <- as.factor(iso$four_regions)
iso$eight_regions <- as.factor(iso$eight_regions)
iso$six_regions <- as.factor(iso$six_regions)
iso$six_regions <- as.factor(iso$six_regions)
iso$World.bank.region <- as.factor(iso$World.bank.region)
str(iso)
'data.frame':   197 obs. of  8 variables:
 $ ISO              : Factor w/ 197 levels "AFG","AGO","ALB",..: 1 3 50 4 2 8 6 7 9 10 ...
 $ country          : Factor w/ 197 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ four_regions     : Factor w/ 4 levels "africa","americas",..: 3 4 1 4 1 2 2 4 3 4 ...
 $ eight_regions    : Factor w/ 8 levels "africa_north",..: 5 7 1 8 2 3 4 7 6 8 ...
 $ six_regions      : Factor w/ 6 levels "america","east_asia_pacific",..: 5 3 4 3 6 1 1 3 2 3 ...
 $ Latitude         : num  33 41 28 42.5 -12.5 ...
 $ Longitude        : num  66 20 3 1.52 18.5 ...
 $ World.bank.region: Factor w/ 8 levels "","East Asia & Pacific",..: 7 3 5 3 8 4 4 3 2 3 ...
#past me helping future me
#fixing Macedonia, FYR to North Macedonia
iso$country <- as.character(iso$country)
iso[iso == "Macedonia, FYR"] <- "North Macedonia"
iso$country <- as.factor(iso$country)

Cool, we have multiple regions here that we could use in the future for some multilevel business. Also, latitude and longitude is here, which we can also try later. Obviously these lat and lon measurements will mean nothing for a country like China or the USA, but they are nice to have.

Income

Income data is coming from Gapminder, as they have put real effort in to filling all the gaps in income and population for the last 150 years. Income is in terms of the 2011 USD and is inflation adjusted.

#income here is 2011 usd 
income <- read.csv("/Volumes/RachelExternal/Thesis/Thesis/Data/Gapminder_income_per_person_gdppercapita_ppp_inflation_adjusted.csv")
income <- income[, c(1, 162:207)] #removing extra years

#making long data
income <- 
  income %>%  
  pivot_longer(!country, names_to = "year", values_to = "income") 

#still have the xs there
income$year <- str_remove_all(income$year, "[X]")
#transformation of the variable types
income$year <- as.numeric(income$year)
income$country <- as.factor(income$country)

#log and standardize
income$log_income <- log(income$income)
income <- income %>% 
  mutate(log_income_std = log_income / mean(log_income))

#a summary
summary(income)
                country          year          income      
 Afghanistan        :  46   Min.   :1960   Min.   :   312  
 Albania            :  46   1st Qu.:1971   1st Qu.:  2250  
 Algeria            :  46   Median :1982   Median :  5290  
 Andorra            :  46   Mean   :1982   Mean   : 11344  
 Angola             :  46   3rd Qu.:1994   3rd Qu.: 13575  
 Antigua and Barbuda:  46   Max.   :2005   Max.   :179000  
 (Other)            :8602                                  
   log_income     log_income_std  
 Min.   : 5.743   Min.   :0.6651  
 1st Qu.: 7.719   1st Qu.:0.8939  
 Median : 8.574   Median :0.9929  
 Mean   : 8.635   Mean   :1.0000  
 3rd Qu.: 9.516   3rd Qu.:1.1020  
 Max.   :12.095   Max.   :1.4007  
                                  
#past me helping future me again
#fixing Eswatini to Swaziland (even though Eswatini is the correct name)
income$country <- as.character(income$country)
income[income == "Eswatini"] <- "Swaziland"
income$country <- as.factor(income$country)

Logging and standardizing is necessary here. Again, the argument is that wealth begets wealth, so income is logarithmic. From there just center the mean at 0 with a stdev of 1 for standardization.

Population

Population data is also taken from Gapminder. The same process of logging and standardizing will be carried out here.

#population data also taken from Gapminder
population <- read.csv("/Volumes/RachelExternal/Thesis/Thesis/Data/Gapminder_population_total.csv")

#removing extra years
population <- population[, c(1, 162:207)] 
#making longer data
population <- 
  population %>%  
  pivot_longer(!country, names_to = "year", values_to = "population") 

#removing the Xs from the year strings
population$year <- str_remove_all(population$year, "[X]")

#fixing variable types
population$country <- as.factor(population$country)
population$year <- as.numeric(population$year)

#log and standardize, same argument as income: population begets population
population$log_pop <- log(population$population)
population <- population %>% 
  mutate(log_pop_std = log_pop / mean(log_pop))

summary(population)
                country          year     
 Afghanistan        :  46   Min.   :1960  
 Albania            :  46   1st Qu.:1971  
 Algeria            :  46   Median :1982  
 Andorra            :  46   Mean   :1982  
 Angola             :  46   3rd Qu.:1994  
 Antigua and Barbuda:  46   Max.   :2005  
 (Other)            :8694                 
   population           log_pop        log_pop_std    
 Min.   :6.450e+02   Min.   : 6.469   Min.   :0.4316  
 1st Qu.:9.045e+05   1st Qu.:13.715   1st Qu.:0.9151  
 Median :4.545e+06   Median :15.330   Median :1.0228  
 Mean   :2.404e+07   Mean   :14.988   Mean   :1.0000  
 3rd Qu.:1.320e+07   3rd Qu.:16.396   3rd Qu.:1.0939  
 Max.   :1.330e+09   Max.   :21.008   Max.   :1.4017  
                                                      
#annnd again...
#fixing Eswatini to Swaziland (even though Eswatini is the correct name)
population$country <- as.character(population$country)
population[population == "Eswatini"] <- "Swaziland"
population$country <- as.factor(population$country)

So some issues arise here with the fact that we have different country data for each dataframe. Sometimes the names are capitalized, other times there are spaces instead of underscores. By using anti_join() I can see what won’t be joined. Also this goes both ways, anti_join() returns rows of x that do not have a match in y.

isopop <- anti_join(iso, population, by = "country")
summary(isopop)
      ISO                country    four_regions
 HKG    :1   Hong Kong, China:1   africa  :0    
 TWN    :1   Taiwan          :1   americas:0    
 AFG    :0   Afghanistan     :0   asia    :2    
 AGO    :0   Albania         :0   europe  :0    
 ALB    :0   Algeria         :0                 
 AND    :0   Andorra         :0                 
 (Other):0   (Other)         :0                 
            eight_regions                   six_regions
 east_asia_pacific :2     america                 :0   
 africa_north      :0     east_asia_pacific       :2   
 africa_sub_saharan:0     europe_central_asia     :0   
 america_north     :0     middle_east_north_africa:0   
 america_south     :0     south_asia              :0   
 asia_west         :0     sub_saharan_africa      :0   
 (Other)           :0                                  
    Latitude       Longitude    
 Min.   :22.29   Min.   :114.2  
 1st Qu.:22.71   1st Qu.:115.9  
 Median :23.14   Median :117.6  
 Mean   :23.14   Mean   :117.6  
 3rd Qu.:23.57   3rd Qu.:119.3  
 Max.   :24.00   Max.   :121.0  
                                
                  World.bank.region
 East Asia & Pacific       :2      
                           :0      
 Europe & Central Asia     :0      
 Latin America & Caribbean :0      
 Middle East & North Africa:0      
 North America             :0      
 (Other)                   :0      
popiso <- anti_join(population, iso, by = "country")
summary(popiso)
                country       year       population 
 Afghanistan        :0   Min.   : NA   Min.   : NA  
 Albania            :0   1st Qu.: NA   1st Qu.: NA  
 Algeria            :0   Median : NA   Median : NA  
 Andorra            :0   Mean   :NaN   Mean   :NaN  
 Angola             :0   3rd Qu.: NA   3rd Qu.: NA  
 Antigua and Barbuda:0   Max.   : NA   Max.   : NA  
 (Other)            :0                              
    log_pop     log_pop_std 
 Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA  
                            

Went back up and solved the issue of North Macedonia and Swaziland. There are still issues with Hong Kong and Taiwan as they are part of China. ISO has data for Hong Kong and Taiwan that population does not have. I wont include them in the HTML but they are in the code.

Alright, so I will left join ISO and population then again with income? Not sure that this will solve my problem, but lets try it.

#remove all these rando tables
rm(popinc, inciso, incpop, popiso, isoinc, isopop)

#
iso <- left_join(iso, population, by = "country")
iso <- left_join(iso, income, by = c("country", "year"))

# alright 197 countries! whoop!
str(iso)
'data.frame':   8972 obs. of  15 variables:
 $ ISO              : Factor w/ 197 levels "AFG","AGO","ALB",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ country          : Factor w/ 197 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ four_regions     : Factor w/ 4 levels "africa","americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ eight_regions    : Factor w/ 8 levels "africa_north",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ six_regions      : Factor w/ 6 levels "america","east_asia_pacific",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Latitude         : num  33 33 33 33 33 33 33 33 33 33 ...
 $ Longitude        : num  66 66 66 66 66 66 66 66 66 66 ...
 $ World.bank.region: Factor w/ 8 levels "","East Asia & Pacific",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ year             : num  1960 1961 1962 1963 1964 ...
 $ population       : int  9000000 9170000 9350000 9540000 9740000 9960000 10200000 10400000 10600000 10900000 ...
 $ log_pop          : num  16 16 16.1 16.1 16.1 ...
 $ log_pop_std      : num  1.07 1.07 1.07 1.07 1.07 ...
 $ income           : int  2740 2700 2680 2670 2650 2640 2600 2600 2620 2590 ...
 $ log_income       : num  7.92 7.9 7.89 7.89 7.88 ...
 $ log_income_std   : num  0.917 0.915 0.914 0.914 0.913 ...
rm(income, population)

Now, join ISO to AEI with a merge (cause in this case the dynamics of merge are clearer for me). Hoping to preserve all data!

aei <- merge(aei, iso, by = c("ISO","country", "year"), all.x = TRUE)
rm(iso)
str(aei) 
'data.frame':   1848 obs. of  21 variables:
 $ ISO              : Factor w/ 232 levels "","ABW","AFG",..: 2 2 2 2 2 2 2 2 3 3 ...
 $ country          : Factor w/ 232 levels "Afghanistan",..: 12 12 12 12 12 12 12 12 1 1 ...
 $ year             : num  1960 1970 1980 1985 1990 ...
 $ ID               : int  1 1 1 1 1 1 1 1 2 2 ...
 $ aei_ha           : num  0 0 0 0 0 ...
 $ yearcount        : num  0 10 20 25 30 35 40 45 0 10 ...
 $ area_km          : num  180 180 180 180 180 ...
 $ irrperc          : num  0 0 0 0 0 ...
 $ irrfrac          : num  0 0 0 0 0 ...
 $ four_regions     : Factor w/ 4 levels "africa","americas",..: NA NA NA NA NA NA NA NA 3 3 ...
 $ eight_regions    : Factor w/ 8 levels "africa_north",..: NA NA NA NA NA NA NA NA 5 5 ...
 $ six_regions      : Factor w/ 6 levels "america","east_asia_pacific",..: NA NA NA NA NA NA NA NA 5 5 ...
 $ Latitude         : num  NA NA NA NA NA NA NA NA 33 33 ...
 $ Longitude        : num  NA NA NA NA NA NA NA NA 66 66 ...
 $ World.bank.region: Factor w/ 8 levels "","East Asia & Pacific",..: NA NA NA NA NA NA NA NA 7 7 ...
 $ population       : int  NA NA NA NA NA NA NA NA 9000000 11200000 ...
 $ log_pop          : num  NA NA NA NA NA ...
 $ log_pop_std      : num  NA NA NA NA NA ...
 $ income           : int  NA NA NA NA NA NA NA NA 2740 2570 ...
 $ log_income       : num  NA NA NA NA NA ...
 $ log_income_std   : num  NA NA NA NA NA ...

Solved the issue, 232 factors for both countries and ISO, meaning that we have been using the original AEI file from Siebert as the main data frame to join to. Good. This is the data that needs to be maintained, NAs can pop up in other cols but not irrigated area.

But I still would like total GDP for later so lets transform using population.

aei$income <- as.numeric(aei$income) 

#mutating for total GDP and log GDP
aei <-
  aei %>%
  mutate(GDPtot = income*population) %>%
  mutate(log_gdp_tot = log(GDPtot)) 

#creating the df to standardize without the NAs
aei2 <-
  aei %>% 
  select(ISO, year, log_gdp_tot) %>%
  na.omit() %>% 
  mutate(log_gdp_tot_std = log_gdp_tot/mean(log_gdp_tot)) %>% 
  select(ISO, year, log_gdp_tot_std)

#now rejoin!
aei <- 
  left_join(aei, aei2, by = c("ISO", "year"))

str(aei)
'data.frame':   1848 obs. of  24 variables:
 $ ISO              : Factor w/ 232 levels "","ABW","AFG",..: 2 2 2 2 2 2 2 2 3 3 ...
 $ country          : Factor w/ 232 levels "Afghanistan",..: 12 12 12 12 12 12 12 12 1 1 ...
 $ year             : num  1960 1970 1980 1985 1990 ...
 $ ID               : int  1 1 1 1 1 1 1 1 2 2 ...
 $ aei_ha           : num  0 0 0 0 0 ...
 $ yearcount        : num  0 10 20 25 30 35 40 45 0 10 ...
 $ area_km          : num  180 180 180 180 180 ...
 $ irrperc          : num  0 0 0 0 0 ...
 $ irrfrac          : num  0 0 0 0 0 ...
 $ four_regions     : Factor w/ 4 levels "africa","americas",..: NA NA NA NA NA NA NA NA 3 3 ...
 $ eight_regions    : Factor w/ 8 levels "africa_north",..: NA NA NA NA NA NA NA NA 5 5 ...
 $ six_regions      : Factor w/ 6 levels "america","east_asia_pacific",..: NA NA NA NA NA NA NA NA 5 5 ...
 $ Latitude         : num  NA NA NA NA NA NA NA NA 33 33 ...
 $ Longitude        : num  NA NA NA NA NA NA NA NA 66 66 ...
 $ World.bank.region: Factor w/ 8 levels "","East Asia & Pacific",..: NA NA NA NA NA NA NA NA 7 7 ...
 $ population       : int  NA NA NA NA NA NA NA NA 9000000 11200000 ...
 $ log_pop          : num  NA NA NA NA NA ...
 $ log_pop_std      : num  NA NA NA NA NA ...
 $ income           : num  NA NA NA NA NA NA NA NA 2740 2570 ...
 $ log_income       : num  NA NA NA NA NA ...
 $ log_income_std   : num  NA NA NA NA NA ...
 $ GDPtot           : num  NA NA NA NA NA ...
 $ log_gdp_tot      : num  NA NA NA NA NA ...
 $ log_gdp_tot_std  : num  NA NA NA NA NA ...

Finally, lets check out a quick histogram of GDP and pop,

hist(aei$log_gdp_tot_std, breaks = 100)

hist(aei$log_pop_std, breaks = 100)

Dealing with Some Issues

There are a lot of countries that have no data for the regions, but have non zero irrperc/frac which I will use in further analysis, so we need to replace the regions.

#view which countries don't have regions but have non zero irrprec
na_regions <- aei[is.na(aei$four_regions),]

#Aruba
aei <- aei %>%
  mutate(four_regions = replace(four_regions, ISO == 'ABW', 'americas'), 
         six_regions = replace(six_regions, ISO == 'ABW', 'america'), 
         eight_regions = replace(eight_regions, ISO == 'ABW', 'america_north')) %>%
#american samoa
  mutate(four_regions = replace(four_regions,ISO == 'ASM', 'asia'), 
         six_regions = replace(six_regions,ISO == 'ASM', 'east_asia_pacific'), 
         eight_regions = replace(eight_regions,ISO == 'ASM', 'east_asia_pacific')) %>%
#bermuda
  mutate(four_regions = replace(four_regions,ISO == 'BMU', 'americas'), 
         six_regions = replace(six_regions,ISO == 'BMU', 'america'), 
         eight_regions = replace(eight_regions,ISO == 'BMU', 'america_north')) %>% 
#cote d'ivoire
  mutate(four_regions = replace(four_regions,ISO == 'CIV', 'africa'), 
         six_regions = replace(six_regions,ISO == 'CIV', 'sub_saharan_africa'), 
         eight_regions = replace(eight_regions,ISO == 'CIV', 'africa_sub_saharan')) %>% 
#democratic republic of congo
  mutate(four_regions = replace(four_regions,ISO == 'COD', 'africa'), 
         six_regions = replace(six_regions,ISO == 'COD', 'sub_saharan_africa'), 
         eight_regions = replace(eight_regions,ISO == 'COD', 'africa_sub_saharan')) %>% 
#republic of congo
  mutate(four_regions = replace(four_regions,ISO == 'COG', 'africa'), 
         six_regions = replace(six_regions,ISO == 'COG', 'sub_saharan_africa'), 
         eight_regions = replace(eight_regions,ISO == 'COG', 'africa_sub_saharan')) %>% 
#cayman islands
  mutate(four_regions = replace(four_regions,ISO == 'CYM', 'americas'), 
         six_regions = replace(six_regions,ISO == 'CYM', 'america'), 
         eight_regions = replace(eight_regions,ISO == 'CYM', 'america_north')) %>%
#Faroe islands
  mutate(four_regions = replace(four_regions,ISO == 'FRO', 'europe'), 
         six_regions = replace(six_regions,ISO == 'FRO', 'europe_central_asia'), 
         eight_regions = replace(eight_regions,ISO == 'FRO', 'europe_west')) %>%
#micronesia
  mutate(four_regions = replace(four_regions,ISO == 'FSM', 'asia'), 
         six_regions = replace(six_regions,ISO == 'FSM', 'east_asia_pacific'), 
         eight_regions = replace(eight_regions,ISO == 'FSM', 'east_asia_pacific')) %>%
#gibraltar
  mutate(four_regions = replace(four_regions,ISO == 'GIB', 'europe'), 
         six_regions = replace(six_regions,ISO == 'GIB', 'europe_central_asia'), 
         eight_regions = replace(eight_regions,ISO == 'GIB', 'europe_west')) %>%
#Guadeloupe
  mutate(four_regions = replace(four_regions,ISO == 'GLP', 'americas'), 
         six_regions = replace(six_regions,ISO == 'GLP', 'america'), 
         eight_regions = replace(eight_regions,ISO == 'GLP', 'america_north')) %>%
#greenland
  mutate(four_regions = replace(four_regions,ISO == 'GRL', 'europe'), 
         six_regions = replace(six_regions,ISO == 'GRL', 'europe_central_asia'), 
         eight_regions = replace(eight_regions,ISO == 'GRL', 'europe_west')) %>%
#french Guiana 
  mutate(four_regions = replace(four_regions,ISO == 'GUF', 'americas'), 
         six_regions = replace(six_regions,ISO == 'GUF', 'america'), 
         eight_regions = replace(eight_regions,ISO == 'GUF', 'america_south')) %>%
#guam
  mutate(four_regions = replace(four_regions,ISO == 'GUM', 'asia'), 
         six_regions = replace(six_regions,ISO == 'GUM', 'east_asia_pacific'), 
         eight_regions = replace(eight_regions,ISO == 'GUM', 'east_asia_pacific')) %>%
#Kyrgyzstan
  mutate(four_regions = replace(four_regions,ISO == 'KGZ', 'asia'), 
         six_regions = replace(six_regions,ISO == 'KGZ', 'europe_central_asia'), 
         eight_regions = replace(eight_regions,ISO == 'KGZ', 'asia_west')) %>%
#laos
  mutate(four_regions = replace(four_regions,ISO == 'LAO', 'asia'), 
         six_regions = replace(six_regions,ISO == 'LAO', 'east_asia_pacific'), 
         eight_regions = replace(eight_regions,ISO == 'LAO', 'east_asia_pacific')) %>%
#Macedonia
  mutate(four_regions = replace(four_regions,ISO == 'MKD', 'europe'), 
         six_regions = replace(six_regions,ISO == 'MKD', 'europe_central_asia'), 
         eight_regions = replace(eight_regions,ISO == 'MKD', 'europe_east')) %>%
#Martinique 
  mutate(four_regions = replace(four_regions,ISO == 'MTQ', 'americas'), 
         six_regions = replace(six_regions,ISO == 'MTQ', 'america'), 
         eight_regions = replace(eight_regions,ISO == 'MTQ', 'america_north')) %>%
#new caledonia
  mutate(four_regions = replace(four_regions,ISO == 'NCL', 'asia'), 
         six_regions = replace(six_regions,ISO == 'NCL', 'east_asia_pacific'), 
         eight_regions = replace(eight_regions,ISO == 'NCL', 'east_asia_pacific')) %>%
#puerto rico
 mutate(four_regions = replace(four_regions,ISO == 'PRI', 'americas'), 
         six_regions = replace(six_regions,ISO == 'PRI', 'america'), 
         eight_regions = replace(eight_regions,ISO == 'PRI', 'america_north')) %>%
#Palestine (West Bank + Gaza strip)
 mutate(four_regions = replace(four_regions,ISO == 'PSE', 'asia'), 
         six_regions = replace(six_regions,ISO == 'PSE', 'middle_east_north_africa'), 
         eight_regions = replace(eight_regions,ISO == 'PSE', 'asia_west')) %>%
#french Polynesia
  mutate(four_regions = replace(four_regions,ISO == 'PYF', 'asia'), 
         six_regions = replace(six_regions,ISO == 'PYF', 'east_asia_pacific'), 
         eight_regions = replace(eight_regions,ISO == 'PYF', 'east_asia_pacific')) %>%
#Reunion
  mutate(four_regions = replace(four_regions,ISO == 'REU', 'africa'), 
         six_regions = replace(six_regions,ISO == 'REU', 'sub_saharan_africa'), 
         eight_regions = replace(eight_regions,ISO == 'REU', 'africa_sub_saharan')) %>%
#Serbia, Montenegro, Kosovo
  mutate(four_regions = replace(four_regions,ISO == 'SMK', 'europe'), 
         six_regions = replace(six_regions,ISO == 'SMK', 'europe_central_asia'), 
         eight_regions = replace(eight_regions,ISO == 'SMK', 'europe_east')) %>%
#Slovakia
  mutate(four_regions = replace(four_regions,ISO == 'SVK', 'europe'), 
         six_regions = replace(six_regions,ISO == 'SVK', 'europe_central_asia'), 
         eight_regions = replace(eight_regions,ISO == 'SVK', 'europe_east')) %>%
#Turks and Caicos Islands
  mutate(four_regions = replace(four_regions,ISO == 'TCA', 'americas'), 
         six_regions = replace(six_regions,ISO == 'TCA', 'america'), 
         eight_regions = replace(eight_regions,ISO == 'TCA', 'america_north')) %>%
#East Timor
  mutate(four_regions = replace(four_regions,ISO == 'TLS', 'asia'), 
         six_regions = replace(six_regions,ISO == 'TLS', 'east_asia_pacific'), 
         eight_regions = replace(eight_regions,ISO == 'TLS', 'east_asia_pacific')) %>%
#taiwan
  mutate(four_regions = replace(four_regions,ISO == 'TWN', 'asia'), 
         six_regions = replace(six_regions,ISO == 'TWN', 'east_asia_pacific'), 
         eight_regions = replace(eight_regions,ISO == 'TWN', 'east_asia_pacific')) %>%
#British virgin Islands
  mutate(four_regions = replace(four_regions,ISO == 'VGB', 'americas'), 
         six_regions = replace(six_regions,ISO == 'VGB', 'america'), 
         eight_regions = replace(eight_regions,ISO == 'VGB', 'america_north')) %>%
#US Virgin Islands
  mutate(four_regions = replace(four_regions,ISO == 'VIR', 'americas'), 
         six_regions = replace(six_regions,ISO == 'VIR', 'america'), 
         eight_regions = replace(eight_regions, ISO == 'VIR', 'america_north')) 

Next time I’ll write a function.

There are also a lot of countries that have NA values for their income or population. Take a peek here.

lotanas <-  aei[rowSums(is.na(aei)) > 0,]
lotanas %>% 
  group_by(country) %>% 
  select(income, population) %>%  
  summarise_all(funs(sum(is.na(.))))
NA
NA

The countries that don’t have income also don’t have population, so as least I am consistent in my missing data. I have to move on so I will just leave this here for a minute. Perhaps later we should fill these variables.

Precipitation

Again, precipitation is translated output from LPJmL. Precipitation was chosen because it could be summed across countries, as other data could not (like discharge or ET). I have summed the precipitation by country and year. I am not sure whether I should scale or standardize precip, I will do all the possible combinations at the moment.

#precipitation
#cftandprecip is a script with all the grid cell values. 
cftandprecip <- read.csv('/Volumes/RachelExternal/Thesis/Thesis/Data/cft_and_precipdata.csv')
cftandprecip$ISO <- as.factor(cftandprecip$ISO)
cftandprecip <- cftandprecip[, -c(1)] #remove id col
cftandprecip <- subset(cftandprecip, year >= 1960)

#log and scale
precip <- 
  cftandprecip %>% 
  select(ISO, year, precipmm) %>% 
  group_by(ISO, year) %>% 
  summarise(preciptot = (sum(precipmm))) %>% #sum grid cells by ISO and year
  mutate(precip_sc = preciptot/max(preciptot)) %>% 
  mutate(precip_std = preciptot/mean(preciptot)) %>% 
  mutate(log_precip = log(preciptot+1)) %>% #1 is added here to prevent -Inf
  mutate(log_precip_sc = log_precip/max(log_precip)) %>% 
  mutate(log_precip_std = log_precip/mean(log_precip))

#merge
aei <- merge(aei, precip, by = c("ISO", "year"), all.x = TRUE)
rm(precip)

And some histograms to check out the scaled/standardized data.

hist(aei$preciptot, breaks = 100)

hist(aei$precip_sc, breaks = 100)

hist(aei$precip_std, breaks = 100)

hist(aei$log_precip, breaks = 100)

hist(aei$log_precip_sc, breaks = 100)

hist(aei$log_precip_std, breaks = 100)

Crop Fraction

Uff. Then comes crop fraction. As mentioned above individual crop fraction per grid cell is multiplied by respective grid cell area, summed on a country level giving ha of each crop grown per country. This was then divided by total area of the country to achieve the % area occupied by a certain crop per country.

#pull cft frac outta here
cft <- 
  cftandprecip %>% 
  select(-c(4)) %>% #removing precip
  mutate_each(funs(.*cellaream2), c(4:35)) %>% #multiply all fractions with the cell area
  group_by(ISO, year) %>% #group in to country and year
  summarise_each(funs(sum)) %>% #take col sums for each crop
  mutate(across(c(3:34), .fns = ~./cellaream2)) #divide col sums by total country area 

#removing categories with 0 (the biofuels and such)
cft <- cft[, -c(18, 19, 26, 34, 35)]

#and a merge
aei <- merge(aei, cft, by = c("ISO", "year"), all.x = TRUE)
rm(cft, cftandprecip)

Idk how to show the data here. Any ideas on representation would be helpful here.

Topographical

Topographical data is represented here by the data set from Riley, Degloria, and Elliot (1999) and is conviently available through the library(rethinking) from McElreath (2015). It provides a ruggedness factor for each country.

#rugged
data("rugged")
rug <- rugged %>% select(c("isocode", "rugged")) 
colnames(rug) <- c("ISO", "rugged")
#scaling
rug$rugged_sc <- rug$rugged / max(rug$rugged)
#merge
aei <- merge(aei, rug, by = "ISO", all.x = TRUE)
rm(rug, rugged)
hist(aei$rugged_sc, breaks = 100)


Time Series Evolution of Country Level Target Variable

Ok, so lets peek at how things change over time. Here is a graph of percentage of country area equipped for irrigation from 1960 - 2005.


pirrfrac <-
  ggplot(data = subset(aei, !is.na(irrperc)), 
         mapping = aes(x=year, y= irrperc, group = country, color = six_regions)) +
  geom_line() +
  scale_x_continuous() +
  theme(panel.grid = element_blank()) 

#run this separately, otherwise the plotly part doesn't work.
ggplotly(pirrfrac)
NA
NA

Although this graph has a lot of countries and can get quite messy, some general trends appear. South Asian countries seem to have high percentages of irrigated area per total land area (irrigation percentage). Then seemingly followed by East Asian and European countries. Africa and the Middle East do not seem to be irrigated as much. Be aware that you can these graphs are interactive, you may zoom, pan or click on the legend to isolate the regions. Although, a regional breakdown follows.

Sub-Saharan Africa

In general, Africa has low levels of irrigated area. Swaziland, Sao Tome and Principe, Reunion and Mauritius lead here with majority of others falling below the levels of these countries.

Europe and Centrtal Asia

Europe contains some countries that reach higher percentages of irrigated area. Countries such as the Netherlands, Romania, Albania, and Azerbaijan have reached peaks in the past (roughly 10-15% irrigated area) but seem to be in a decline towards the end our study period. Other countries such as Italy and Denmark have had more stable rises with less down turn toward. Unexpectedly, Russia has a very low percentage of irrigated area, however at this point, the assumption is that either Russia’s enormous size cancels out its irrigated area, or that Russia receives enough precipitation for the crops it grows that irrigation is not necessary. Further analysis should reveal this.

Americas

In the Americas, the % of area irrigated by countries is lower than Asia or Europe. Peaks here are Cuba and the Dominican Republic at around roughly 6-8% of their total countries being irrigated. There are some islands (Barbados and St. Lucia) that have sharp increases in irrigation fraction, but this could be because they are islands and any irrigation at all would induce the ration of irrigated land to total land spike. Counties that have experienced a steady increase, but are not the leaders in overall irrigation fraction are countries that are big and produce a lot of produce (USA, Mexico, Chile) The general trends of irrigation increase seem to be upward with perhaps a slight leveling off toward the end of the study period (for some countries) but lacking the distinct downturn that is seen in Europe.

East Asia and Oceania

The East Asian countries are quite high in % of irrigated area, when compared to the Americas or Africa. The four countries that reach upwards of (roughly) 10% or more of their total area irrigated toward the end of the study period are Vietnam, North Korea, South Korea, and Thailand. Other countries that have high irrfracs are China, Japan and the Philippines. It is expected that for this region the irrigation fraction would be high due to the amount of rice produced.

Oceania, due to its large amount of islands, does not have a lot of irrigation, majority of countries have 0% irrigation. New Zealand (around 2%) and Australia (less than 1%) have some irrigation. Although this could have a multitude of reasons why irrigation is so low here, regardless of the fact that Australia and New Zealand are very populous countries with mouths to feed. Australia is very large and perhaps its enormity cancels out the irrigated land it does have.

South Asia

Very few countries here, countries in this area are large. Very high levels of irrigation here coming from most countries. Bangladesh, Pakistan, and India are heavily irrigated compared to their land area.

Middle East and North Africa

Irrigation is also prevalent here (but not as much as in south Asia). These countries are dry and therefore would need irrigation to have domestic food supplies. Lebanon, Iraq and Israel lead here. Oil producing countries have little irrigation, they are dry and their GDP is high, perhaps importation is the easiest option for them?


Overall, this might be a good overview of what irrperc looks like for each region.


theme_set(theme_minimal())

ggplot(subset(aei, !is.na(irrperc)),
       aes(x=six_regions, y=irrperc, fill=six_regions)) + 
  geom_boxplot(show.legend = FALSE)  + 
  coord_flip() +
  theme(legend.title = element_blank())


References

McElreath, Richard. 2015. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC press. https://xcelab.net/rm/statistical-rethinking/.
Riley, Shawn, Stephen Degloria, and S. D. Elliot. 1999. “A Terrain Ruggedness Index That Quantifies Topographic Heterogeneity.” Internation Journal of Science 5 (January): 23–27.
Siebert, Stefan, Matti Kummu, Miina Porkka, Petra Doell, Navin Ramankutty, and Bridget Scanlon. 2015. “A Global Data Set of the Extent of Irrigated Land from 1900 to 2005.” Hydrology and Earth System Sciences 19 (March): 1521–45. https://doi.org/10.5194/hess-19-1521-2015.
Walker, W. R. 1989. FAO Irrigation and Drainage Paper 45.” 1. Vol. 45. Rome.
---
title: "2 Country Level Data Cleaning"
output: 
  html_document:
  toc: yes
  toc_float: yes
bibliography: /Volumes/RachelExternal/Thesis/Thesis/bib.bib
---

```{r Housekeeping, warning=FALSE, message=FALSE, include=FALSE}
library(brms)
library(rstan)
library(dplyr)
library(tidyr)
library(plotly)
library(ggplot2)
library(rethinking)
library(purrr)
library(gganimate)
library(wbstats)
library(viridis)
library(tm)
library(wbstats)
library(tibble)
library(stringr)
library(distill)
```



Lets do a short introduction to what we're working with here. 

# Target Variable  
  
* $Y$ - Irrigation Percentage - Downloaded the data from @siebertGlobalDataSet2015 data set in ha/country/timestep, then by dividing by the countries total area, the fraction of total land area that is irrigated in a given country is created. 
```{r Setting up Country Level df, warning=FALSE}
#original Siebert Data
aei <- read.csv2("/Volumes/RachelExternal/Thesis/Data/SiebertData/HID_v10/Country_tenyear_ha.csv")
aei <- as.data.frame(aei)

#rename the cols so that I can use them as number
colnames(aei) <- c("country", "ISO", "ID", "1900", "1910","1920","1930","1940","1950","1960","1970","1980","1985","1990","1995","2000","2005")

#tidying data to long format
aei <- pivot_longer(aei,cols = 4:17, names_to = "year", values_to = "aei_ha")
aei$aei_ha <- removePunctuation(aei$aei_ha)
aei$year <- as.numeric(aei$year)
aei$aei_ha <- as.numeric(aei$aei_ha)
aei$ISO <- as.factor(aei$ISO)
aei$country <- as.factor(aei$country)
#subset the data so that its from the year 1960
aei <- aei %>% subset(year >= 1960)

#make a new col for year count
aei$yearcount <- (aei$year - 1960)

#remove data for the total global irrigated area
globeaei <- subset(aei, country == "WORLD")
aei= aei[!aei$country == "WORLD",]

str(aei)
```

      
Added to this to be able to calculate the % of land irrigated per country we need the country area. This data is taken from Worldbank.  

```{r message=FALSE, warning=FALSE}
#import the data set with total country area, this comes from the Worldbank
countryarea <- read.csv2("/Volumes/RachelExternal/Thesis/Data/LandArea/LandArea.csv")

#removing extra cols
countryarea <- countryarea[-c(265:270), ]
countryarea <- countryarea[, -c(1, 3:4)]

#fix the col names (in the two days since i found this little for loop I learned how to do this faster!)
for ( col in 2:ncol(countryarea)){
    colnames(countryarea)[col] <-  sub("X", "", colnames(countryarea)[col])
}
remove(col)

#making long data
countryarea <- 
  countryarea %>%  
  pivot_longer(!Country.Code, names_to = "year", values_to = "area_km") %>% 
  rename(ISO = Country.Code)

#fixing variable types
countryarea$ISO <- as.factor(countryarea$ISO)
countryarea$year <- as.numeric(countryarea$year)
str(countryarea)
```


For some reason, the data from Worldbank on each countries area only starts in 1961. For this analysis, 1960 is needed. Country area from 1961 will be substituted for 1960, with the assumption that country area doesn't change drastically in one year (it doesn't).  

```{r warning = FALSE}
head(subset(countryarea, c(year == 1960 | year == 1961))) 

#fixing 1960
countryarea <-
  countryarea %>% 
  group_by(ISO) %>% 
  fill(area_km, .direction = "up") %>% 
  ungroup()

```

Merge the country area with the AEI df here.   

```{r Join CAr }
#combine these data sets
aei <- aei %>% merge(y = countryarea, by.x = c("ISO", "year"), all.x = TRUE)

rm(countryarea)
head(aei)
```

Some countries also have no area value. I'll go through and see if any of the countries with missing area values have non zero AEI_ha. If so, I'll replace them, as they didn't come through with the WorldBank data set but are online. Copying and pasting from various sources should be fine.


```{r Fixes for CAr}
#checking for countries with NA values in their area_km col
#looking primarily for countries that have non 0 AEI_ha
aei[is.na(aei$area_km),]
                
#fixing counties with no area
aei$area_km[which(aei$ISO == "GUF")] <- 83534 #French Guiana from Britannica
aei$area_km[which(aei$ISO == "GLP")] <- 1630 #Guadeloupe from Britannica 
aei$area_km[which(aei$ISO == "MTQ")] <- 1128 # Martinique from Britannica
aei$area_km[which(aei$ISO == "REU")] <- 2512 #Reunion from Britannica
aei$area_km[which(aei$ISO == "SMK")] <- 13812 + 10905 + 77589 #Montenegro + Kosovo + Serbia from Britannica
aei$area_km[which(aei$ISO == "TWN")] <- 35980 #Taiwan from CIA.gov
aei$area_km[which(aei$ISO == "SDN")] <- 1882000 #Sudan from UNDP
```


Now, transform the AEI_ha to $km^2$ and divide to get the % of total country area that is equipped for irrigation. There are still some countries that have 0 AEI_ha and no area_km value. I think I will assign the irrfrac/perc 0 to those countries, but lets wait a little while to do that, most are islands.  

```{r Creating Target Variable}
aei <- aei %>% mutate(irrperc = ((aei_ha/100)/area_km)*100, irrfrac = ((aei_ha/100)/area_km))
```

Here is a quick histogram of the target variable. A lot of 0s. 

```{r Hist of TVar}
hist(aei$irrperc, breaks = 200)
```

***

# Chosen Predictor Variables  

According to the FAO [@walkerFAOIrrigationDrainage1989], the choice of which irrigation, and by proxy whether to irrigate or not, comes from six factors. The predictor that will be used for this analysis are listed below:  

- Economic 

  - $M_c$ - GDP per Capita - Each country's GDP per Capita from Gapminder, which collects data from the UN and the Maddison project. This variable will be logged because as stated in *Statistical Rethinking* [@mcelreathStatisticalRethinkingBayesian2015] McElreath states (pg. 239), "We use the logarithm of it, because the logarithm of GDP is the *magnitude* of GDP. Since wealth generates wealth, it tends to be exponentially related to anything that increases it"
  
- Topographical   

  - $Ru_c$ - Represented here using the "rugged" data set from rethinking, in which a coeff of ruggedness for each country was assigned. This data was taken from @rileyTerrainRuggednessIndex1999
  
- Soil Type  

  - This will not be included in this analysis. Although, it could be useful there is no good way to quantify soil type on a country level basis. This will be included in the grid cell model, as regions can be easily integrated with the grid cell structure. 

- Water Supply    

  - $R_c$ - Total Precipitation -  Translated output from LPJmL. Summed across two dimensions: month and country. In mm/country/year  
- Crop Type  

  - Fraction of each crop grown per country -  Translated output from LPJmL. Individual crop fraction per grid cell multiplied by respective grid cell area, summed on a country level giving ha of each crop grown per country. This was then divided by total area of the country to achieve the % area occupied by a certain crop per country. Still working out how to work with this predictor. 

- Social Influences  

  - $P_c$ - Total Population - Each country's total population on a yearly basis. Taken from Gapminder. Again, this will be logged, due to the same argument for GDP, and centered.
  
All variables will be scaled or centered. 


First, the ISO codes and Regions to be able to hold some structure.  

```{r}
#the ISO codes and regional data from Gapminder
iso <- read.csv2("/Volumes/RachelExternal/Thesis/Thesis/Data/GAPminder_ISOcodes.csv")
iso <- iso[, -c(2,7,10,12,13)] #removing extra info

#renaming some cols
iso <- 
  iso %>% 
  rename(ISO = GEO, country = name)

#fixing the variable types
iso$ISO <- as.factor(iso$ISO)
iso$country <- as.factor(iso$country)
iso$four_regions <- as.factor(iso$four_regions)
iso$eight_regions <- as.factor(iso$eight_regions)
iso$six_regions <- as.factor(iso$six_regions)
iso$six_regions <- as.factor(iso$six_regions)
iso$World.bank.region <- as.factor(iso$World.bank.region)
str(iso)

#past me helping future me
#fixing Macedonia, FYR to North Macedonia
iso$country <- as.character(iso$country)
iso[iso == "Macedonia, FYR"] <- "North Macedonia"
iso$country <- as.factor(iso$country)
```
Cool, we have multiple regions here that we could use in the future for some multilevel business. Also, latitude and longitude is here, which we can also try later. Obviously these lat and lon measurements will mean nothing for a country like China or the USA, but they are nice to have.    


## Income

Income data is coming from Gapminder, as they have put real effort in to filling all the gaps in income and population for the last 150 years. Income is in terms of the 2011 USD and is inflation adjusted.  

```{r}
#income here is 2011 usd 
income <- read.csv("/Volumes/RachelExternal/Thesis/Thesis/Data/Gapminder_income_per_person_gdppercapita_ppp_inflation_adjusted.csv")
income <- income[, c(1, 162:207)] #removing extra years

#making long data
income <- 
  income %>%  
  pivot_longer(!country, names_to = "year", values_to = "income") 

#still have the xs there
income$year <- str_remove_all(income$year, "[X]")
#transformation of the variable types
income$year <- as.numeric(income$year)
income$country <- as.factor(income$country)

#log and standardize
income$log_income <- log(income$income)
income <- income %>% 
  mutate(log_income_std = log_income / mean(log_income))

#a summary
summary(income)

#past me helping future me again
#fixing Eswatini to Swaziland (even though Eswatini is the correct name)
income$country <- as.character(income$country)
income[income == "Eswatini"] <- "Swaziland"
income$country <- as.factor(income$country)

```

Logging and standardizing is necessary here. Again, the argument is that wealth begets wealth, so income is logarithmic. From there just center the mean at 0 with a stdev of 1 for standardization.  



## Population

Population data is also taken from Gapminder. The same process of logging and standardizing will be carried out here. 

```{r}
#population data also taken from Gapminder
population <- read.csv("/Volumes/RachelExternal/Thesis/Thesis/Data/Gapminder_population_total.csv")

#removing extra years
population <- population[, c(1, 162:207)] 
#making longer data
population <- 
  population %>%  
  pivot_longer(!country, names_to = "year", values_to = "population") 

#removing the Xs from the year strings
population$year <- str_remove_all(population$year, "[X]")

#fixing variable types
population$country <- as.factor(population$country)
population$year <- as.numeric(population$year)

#log and standardize, same argument as income: population begets population
population$log_pop <- log(population$population)
population <- population %>% 
  mutate(log_pop_std = log_pop / mean(log_pop))

summary(population)

#annnd again...
#fixing Eswatini to Swaziland (even though Eswatini is the correct name)
population$country <- as.character(population$country)
population[population == "Eswatini"] <- "Swaziland"
population$country <- as.factor(population$country)
```

So some issues arise here with the fact that we have different country data for each dataframe. Sometimes the names are capitalized, other times there are spaces instead of underscores. By using `anti_join()` I can see what won't be joined. Also this goes both ways, `anti_join()` returns rows of x that do not have a match in y.  

```{r}
isopop <- anti_join(iso, population, by = "country")
summary(isopop)

popiso <- anti_join(population, iso, by = "country")
summary(popiso)
```

Went back up and solved the issue of North Macedonia and Swaziland. There are still issues with Hong Kong and Taiwan as they are part of China. ISO has data for Hong Kong and Taiwan that population does not have.  I wont include them in the HTML but they are in the code.  

```{r include=FALSE}
isoinc <- anti_join(iso, income, by = "country")
summary(isoinc)

inciso <- anti_join(income, iso, by = "country")
summary(inciso)
```


```{r include=FALSE}
popinc <- anti_join(population, income, by = "country")
summary(popinc)

incpop <- anti_join(income, population, by = "country")
summary(incpop)
```


Alright, so I will left join ISO and population then again with income? Not sure that this will solve my problem, but lets try it.
```{r}
#remove all these rando tables
rm(popinc, inciso, incpop, popiso, isoinc, isopop)

#
iso <- left_join(iso, population, by = "country")
iso <- left_join(iso, income, by = c("country", "year"))

# alright 197 countries! whoop!
str(iso)

rm(income, population)
```

Now, join ISO to AEI with a merge (cause in this case the dynamics of merge are clearer for me). Hoping to preserve all data! 

```{r}
aei <- merge(aei, iso, by = c("ISO","country", "year"), all.x = TRUE)
rm(iso)
str(aei) 
```

Solved the issue, 232 factors for both countries and ISO, meaning that we have been using the original AEI file from Siebert as the main data frame to join to. Good. This is the data that needs to be maintained, NAs can pop up in other cols but not irrigated area.   

But I still would like total GDP for later so lets transform using population. 
```{r}
aei$income <- as.numeric(aei$income) 

#mutating for total GDP and log GDP
aei <-
  aei %>%
  mutate(GDPtot = income*population) %>%
  mutate(log_gdp_tot = log(GDPtot)) 

#creating the df to standardize without the NAs
aei2 <-
  aei %>% 
  select(ISO, year, log_gdp_tot) %>%
  na.omit() %>% 
  mutate(log_gdp_tot_std = log_gdp_tot/mean(log_gdp_tot)) %>% 
  select(ISO, year, log_gdp_tot_std)

#now rejoin!
aei <- 
  left_join(aei, aei2, by = c("ISO", "year"))

str(aei)
```


Finally, lets check out a quick histogram of GDP and pop,
```{r}
hist(aei$log_gdp_tot_std, breaks = 100)
```

```{r}
hist(aei$log_pop_std, breaks = 100)
```


## Dealing with Some Issues

There are a lot of countries that have no data for the regions, but have non zero irrperc/frac which I will use in further analysis, so we need to replace the regions. 

```{r}
#view which countries don't have regions but have non zero irrprec
na_regions <- aei[is.na(aei$four_regions),]

#Aruba
aei <- aei %>%
  mutate(four_regions = replace(four_regions, ISO == 'ABW', 'americas'), 
         six_regions = replace(six_regions, ISO == 'ABW', 'america'), 
         eight_regions = replace(eight_regions, ISO == 'ABW', 'america_north')) %>%
#american samoa
  mutate(four_regions = replace(four_regions,ISO == 'ASM', 'asia'), 
         six_regions = replace(six_regions,ISO == 'ASM', 'east_asia_pacific'), 
         eight_regions = replace(eight_regions,ISO == 'ASM', 'east_asia_pacific')) %>%
#bermuda
  mutate(four_regions = replace(four_regions,ISO == 'BMU', 'americas'), 
         six_regions = replace(six_regions,ISO == 'BMU', 'america'), 
         eight_regions = replace(eight_regions,ISO == 'BMU', 'america_north')) %>% 
#cote d'ivoire
  mutate(four_regions = replace(four_regions,ISO == 'CIV', 'africa'), 
         six_regions = replace(six_regions,ISO == 'CIV', 'sub_saharan_africa'), 
         eight_regions = replace(eight_regions,ISO == 'CIV', 'africa_sub_saharan')) %>% 
#democratic republic of congo
  mutate(four_regions = replace(four_regions,ISO == 'COD', 'africa'), 
         six_regions = replace(six_regions,ISO == 'COD', 'sub_saharan_africa'), 
         eight_regions = replace(eight_regions,ISO == 'COD', 'africa_sub_saharan')) %>% 
#republic of congo
  mutate(four_regions = replace(four_regions,ISO == 'COG', 'africa'), 
         six_regions = replace(six_regions,ISO == 'COG', 'sub_saharan_africa'), 
         eight_regions = replace(eight_regions,ISO == 'COG', 'africa_sub_saharan')) %>% 
#cayman islands
  mutate(four_regions = replace(four_regions,ISO == 'CYM', 'americas'), 
         six_regions = replace(six_regions,ISO == 'CYM', 'america'), 
         eight_regions = replace(eight_regions,ISO == 'CYM', 'america_north')) %>%
#Faroe islands
  mutate(four_regions = replace(four_regions,ISO == 'FRO', 'europe'), 
         six_regions = replace(six_regions,ISO == 'FRO', 'europe_central_asia'), 
         eight_regions = replace(eight_regions,ISO == 'FRO', 'europe_west')) %>%
#micronesia
  mutate(four_regions = replace(four_regions,ISO == 'FSM', 'asia'), 
         six_regions = replace(six_regions,ISO == 'FSM', 'east_asia_pacific'), 
         eight_regions = replace(eight_regions,ISO == 'FSM', 'east_asia_pacific')) %>%
#gibraltar
  mutate(four_regions = replace(four_regions,ISO == 'GIB', 'europe'), 
         six_regions = replace(six_regions,ISO == 'GIB', 'europe_central_asia'), 
         eight_regions = replace(eight_regions,ISO == 'GIB', 'europe_west')) %>%
#Guadeloupe
  mutate(four_regions = replace(four_regions,ISO == 'GLP', 'americas'), 
         six_regions = replace(six_regions,ISO == 'GLP', 'america'), 
         eight_regions = replace(eight_regions,ISO == 'GLP', 'america_north')) %>%
#greenland
  mutate(four_regions = replace(four_regions,ISO == 'GRL', 'europe'), 
         six_regions = replace(six_regions,ISO == 'GRL', 'europe_central_asia'), 
         eight_regions = replace(eight_regions,ISO == 'GRL', 'europe_west')) %>%
#french Guiana 
  mutate(four_regions = replace(four_regions,ISO == 'GUF', 'americas'), 
         six_regions = replace(six_regions,ISO == 'GUF', 'america'), 
         eight_regions = replace(eight_regions,ISO == 'GUF', 'america_south')) %>%
#guam
  mutate(four_regions = replace(four_regions,ISO == 'GUM', 'asia'), 
         six_regions = replace(six_regions,ISO == 'GUM', 'east_asia_pacific'), 
         eight_regions = replace(eight_regions,ISO == 'GUM', 'east_asia_pacific')) %>%
#Kyrgyzstan
  mutate(four_regions = replace(four_regions,ISO == 'KGZ', 'asia'), 
         six_regions = replace(six_regions,ISO == 'KGZ', 'europe_central_asia'), 
         eight_regions = replace(eight_regions,ISO == 'KGZ', 'asia_west')) %>%
#laos
  mutate(four_regions = replace(four_regions,ISO == 'LAO', 'asia'), 
         six_regions = replace(six_regions,ISO == 'LAO', 'east_asia_pacific'), 
         eight_regions = replace(eight_regions,ISO == 'LAO', 'east_asia_pacific')) %>%
#Macedonia
  mutate(four_regions = replace(four_regions,ISO == 'MKD', 'europe'), 
         six_regions = replace(six_regions,ISO == 'MKD', 'europe_central_asia'), 
         eight_regions = replace(eight_regions,ISO == 'MKD', 'europe_east')) %>%
#Martinique 
  mutate(four_regions = replace(four_regions,ISO == 'MTQ', 'americas'), 
         six_regions = replace(six_regions,ISO == 'MTQ', 'america'), 
         eight_regions = replace(eight_regions,ISO == 'MTQ', 'america_north')) %>%
#new caledonia
  mutate(four_regions = replace(four_regions,ISO == 'NCL', 'asia'), 
         six_regions = replace(six_regions,ISO == 'NCL', 'east_asia_pacific'), 
         eight_regions = replace(eight_regions,ISO == 'NCL', 'east_asia_pacific')) %>%
#puerto rico
 mutate(four_regions = replace(four_regions,ISO == 'PRI', 'americas'), 
         six_regions = replace(six_regions,ISO == 'PRI', 'america'), 
         eight_regions = replace(eight_regions,ISO == 'PRI', 'america_north')) %>%
#Palestine (West Bank + Gaza strip)
 mutate(four_regions = replace(four_regions,ISO == 'PSE', 'asia'), 
         six_regions = replace(six_regions,ISO == 'PSE', 'middle_east_north_africa'), 
         eight_regions = replace(eight_regions,ISO == 'PSE', 'asia_west')) %>%
#french Polynesia
  mutate(four_regions = replace(four_regions,ISO == 'PYF', 'asia'), 
         six_regions = replace(six_regions,ISO == 'PYF', 'east_asia_pacific'), 
         eight_regions = replace(eight_regions,ISO == 'PYF', 'east_asia_pacific')) %>%
#Reunion
  mutate(four_regions = replace(four_regions,ISO == 'REU', 'africa'), 
         six_regions = replace(six_regions,ISO == 'REU', 'sub_saharan_africa'), 
         eight_regions = replace(eight_regions,ISO == 'REU', 'africa_sub_saharan')) %>%
#Serbia, Montenegro, Kosovo
  mutate(four_regions = replace(four_regions,ISO == 'SMK', 'europe'), 
         six_regions = replace(six_regions,ISO == 'SMK', 'europe_central_asia'), 
         eight_regions = replace(eight_regions,ISO == 'SMK', 'europe_east')) %>%
#Slovakia
  mutate(four_regions = replace(four_regions,ISO == 'SVK', 'europe'), 
         six_regions = replace(six_regions,ISO == 'SVK', 'europe_central_asia'), 
         eight_regions = replace(eight_regions,ISO == 'SVK', 'europe_east')) %>%
#Turks and Caicos Islands
  mutate(four_regions = replace(four_regions,ISO == 'TCA', 'americas'), 
         six_regions = replace(six_regions,ISO == 'TCA', 'america'), 
         eight_regions = replace(eight_regions,ISO == 'TCA', 'america_north')) %>%
#East Timor
  mutate(four_regions = replace(four_regions,ISO == 'TLS', 'asia'), 
         six_regions = replace(six_regions,ISO == 'TLS', 'east_asia_pacific'), 
         eight_regions = replace(eight_regions,ISO == 'TLS', 'east_asia_pacific')) %>%
#taiwan
  mutate(four_regions = replace(four_regions,ISO == 'TWN', 'asia'), 
         six_regions = replace(six_regions,ISO == 'TWN', 'east_asia_pacific'), 
         eight_regions = replace(eight_regions,ISO == 'TWN', 'east_asia_pacific')) %>%
#British virgin Islands
  mutate(four_regions = replace(four_regions,ISO == 'VGB', 'americas'), 
         six_regions = replace(six_regions,ISO == 'VGB', 'america'), 
         eight_regions = replace(eight_regions,ISO == 'VGB', 'america_north')) %>%
#US Virgin Islands
  mutate(four_regions = replace(four_regions,ISO == 'VIR', 'americas'), 
         six_regions = replace(six_regions,ISO == 'VIR', 'america'), 
         eight_regions = replace(eight_regions, ISO == 'VIR', 'america_north')) 

```
Next time I'll write a function.

There are also a lot of countries that have NA values for their income or population. Take a peek here.

```{r message=FALSE, warning=FALSE}
lotanas <-  aei[rowSums(is.na(aei)) > 0,]
lotanas %>% 
  group_by(country) %>% 
  select(income, population) %>%  
  summarise_all(funs(sum(is.na(.))))


```


The countries that don't have income also don't have population, so as least I am consistent in my missing data. I have to move on so I will just leave this here for a minute. Perhaps later we should fill these variables.  



## Precipitation

Again, precipitation is translated output from LPJmL. Precipitation was chosen because it could be summed across countries, as other data could not (like discharge or ET). I have summed the precipitation by country and year. I am not sure whether I should scale or standardize precip, I will do all the possible combinations at the moment. 
```{r Precip Manipulation, message=FALSE, warning=FALSE}
#precipitation
#cftandprecip is a script with all the grid cell values. 
cftandprecip <- read.csv('/Volumes/RachelExternal/Thesis/Thesis/Data/cft_and_precipdata.csv')
cftandprecip$ISO <- as.factor(cftandprecip$ISO)
cftandprecip <- cftandprecip[, -c(1)] #remove id col
cftandprecip <- subset(cftandprecip, year >= 1960)

#log and scale
precip <- 
  cftandprecip %>% 
  select(ISO, year, precipmm) %>% 
  group_by(ISO, year) %>% 
  summarise(preciptot = (sum(precipmm))) %>% #sum grid cells by ISO and year
  mutate(precip_sc = preciptot/max(preciptot)) %>% 
  mutate(precip_std = preciptot/mean(preciptot)) %>% 
  mutate(log_precip = log(preciptot+1)) %>% #1 is added here to prevent -Inf
  mutate(log_precip_sc = log_precip/max(log_precip)) %>% 
  mutate(log_precip_std = log_precip/mean(log_precip))

#merge
aei <- merge(aei, precip, by = c("ISO", "year"), all.x = TRUE)
rm(precip)
```

And some histograms to check out the scaled/standardized data. 
```{r Precip Hists}
hist(aei$preciptot, breaks = 100)
hist(aei$precip_sc, breaks = 100)
hist(aei$precip_std, breaks = 100)
hist(aei$log_precip, breaks = 100)
hist(aei$log_precip_sc, breaks = 100)
hist(aei$log_precip_std, breaks = 100)
```


## Crop Fraction

Uff. Then comes crop fraction. As mentioned above individual crop fraction per grid cell is multiplied by respective grid cell area, summed on a country level giving ha of each crop grown per country. This was then divided by total area of the country to achieve the % area occupied by a certain crop per country.
```{r CFT manipulation, message=FALSE, warning=FALSE}
#pull cft frac outta here
cft <- 
  cftandprecip %>% 
  select(-c(4)) %>% #removing precip
  mutate_each(funs(.*cellaream2), c(4:35)) %>% #multiply all fractions with the cell area
  group_by(ISO, year) %>% #group in to country and year
  summarise_each(funs(sum)) %>% #take col sums for each crop
  mutate(across(c(3:34), .fns = ~./cellaream2)) #divide col sums by total country area 

#removing categories with 0 (the biofuels and such)
cft <- cft[, -c(18, 19, 26, 34, 35)]

#and a merge
aei <- merge(aei, cft, by = c("ISO", "year"), all.x = TRUE)
rm(cft, cftandprecip)
```

Idk how to show the data here. Any ideas on representation would be helpful here. 

## Topographical

Topographical data is represented here by the data set from @rileyTerrainRuggednessIndex1999 and is conviently available through the `library(rethinking)` from @mcelreathStatisticalRethinkingBayesian2015. It provides a ruggedness factor for each country. 

```{r}
#rugged
data("rugged")
rug <- rugged %>% select(c("isocode", "rugged")) 
colnames(rug) <- c("ISO", "rugged")
#scaling
rug$rugged_sc <- rug$rugged / max(rug$rugged)
#merge
aei <- merge(aei, rug, by = "ISO", all.x = TRUE)
rm(rug, rugged)


```

```{r}
hist(aei$rugged_sc, breaks = 100)
```


***   



# Time Series Evolution of Country Level Target Variable

Ok, so lets peek at how things change over time. Here is a graph of percentage of country area equipped for irrigation from 1960 - 2005.

```{r Irrigated Totals Graph, warning=FALSE}

pirrfrac <-
  ggplot(data = subset(aei, !is.na(irrperc)), 
         mapping = aes(x=year, y= irrperc, group = country, color = six_regions)) +
  geom_line() +
  scale_x_continuous() +
  theme(panel.grid = element_blank()) 

#run this separately, otherwise the plotly part doesn't work.
ggplotly(pirrfrac)


```
Although this graph has a lot of countries and can get quite messy, some general trends appear. South Asian countries seem to have high percentages of irrigated area per total land area (irrigation percentage). Then seemingly followed by East Asian and European countries. Africa and the Middle East do not seem to be irrigated as much. Be aware that you can these graphs are interactive, you may zoom, pan or click on the legend to isolate the regions. Although, a regional breakdown follows.



## Sub-Saharan Africa
  
```{r Africa Irrigated Totals Graph, echo=FALSE, warning=FALSE}
pirrssAF <-
  ggplot(data = subset(aei, six_regions == "sub_saharan_africa"), 
         mapping = aes(x=year, y= irrperc, color = country)) +
  geom_line() +
  scale_x_continuous() +
  theme(panel.grid = element_blank()) 

#run this separately, otherwise the plotly part doesn't work.
ggplotly(pirrssAF)
```
  
In general, Africa has low levels of irrigated area. Swaziland, Sao Tome and Principe, Reunion and Mauritius lead here with majority of others falling below the levels of these countries.

## Europe and Centrtal Asia
  
```{r Europe Irrigated Totals Graph, echo=FALSE, warning=FALSE}
pirrEU <-
  ggplot(data = subset(aei, six_regions == "europe_central_asia"), 
         mapping = aes(x=year, y= irrperc, color = country)) +
  geom_line() +
  scale_x_continuous() +
  theme(panel.grid = element_blank()) 

#run this separately, otherwise the plotly part doesn't work.
ggplotly(pirrEU)
```
  
Europe contains some countries that reach higher percentages of irrigated area. Countries such as the Netherlands, Romania, Albania, and Azerbaijan have reached peaks in the past (roughly 10-15% irrigated area) but seem to be in a decline towards the end our study period. Other countries such as Italy and Denmark have had more stable rises with less down turn toward. Unexpectedly, Russia has a very low percentage of irrigated area, however at this point, the assumption is that either Russia's enormous size cancels out its irrigated area, or that Russia receives enough precipitation for the crops it grows that irrigation is not necessary. Further analysis should reveal this.  

## Americas

```{r Americas Irrigated Totals Graph, echo=FALSE, warning=FALSE}
pirrAM <-
  ggplot(data = subset(aei, six_regions == "america"), 
         mapping = aes(x=year, y= irrperc, color = country)) +
  geom_line() +
  scale_x_continuous() +
  theme(panel.grid = element_blank()) 

#run this separately, otherwise the plotly part doesn't work.
ggplotly(pirrAM)
```
  
In the Americas, the % of area irrigated by countries is lower than Asia or Europe. Peaks here are Cuba and the Dominican Republic at around roughly 6-8% of their total countries being irrigated. There are some islands (Barbados and St. Lucia) that have sharp increases in irrigation fraction, but this could be because they are islands and any irrigation at all would induce the ration of irrigated land to total land spike. Counties that have experienced a steady increase, but are not the leaders in overall irrigation fraction are countries that are big and produce a lot of produce (USA, Mexico, Chile) The general trends of irrigation increase seem to be upward with perhaps a slight leveling off toward the end of the study period (for some countries) but lacking the distinct downturn that is seen in Europe.   
  
## East Asia and Oceania

```{r East Asia and Oceania Irrigated Totals Graph, echo=FALSE, warning=FALSE}
pirrEAsia <- 
  ggplot(data = subset(aei, six_regions == "east_asia_pacific"), 
         mapping = aes(x=year, y= irrperc, color = country)) +
  geom_line() +
  scale_x_continuous() +
  theme(panel.grid = element_blank()) 

#run this separately, otherwise the plotly part doesn't work.
ggplotly(pirrEAsia)
```
  
The East Asian countries are quite high in % of irrigated area, when compared to the Americas or Africa. The four countries that reach upwards of (roughly) 10% or more of their total area irrigated toward the end of the study period are Vietnam, North Korea, South Korea, and Thailand. Other countries that have high irrfracs are China, Japan and the Philippines. It is expected that for this region the irrigation fraction would be high due to the amount of rice produced. 

Oceania, due to its large amount of islands, does not have a lot of irrigation, majority of countries have 0% irrigation. New Zealand (around 2%) and Australia (less than 1%) have some irrigation. Although this could have a multitude of reasons why irrigation is so low here, regardless of the fact that Australia and New Zealand are very populous countries with mouths to feed. Australia is very large and perhaps its enormity cancels out the irrigated land it does have. 
  
## South Asia
```{r South Asia Irrigated Totals Graph, echo=FALSE, warning=FALSE}
pirrSAsia <-
  ggplot(data = subset(aei, six_regions == "south_asia"), 
         mapping = aes(x=year, y= irrperc, color = country)) +
  geom_line() +
  scale_x_continuous() +
  theme(panel.grid = element_blank()) 

#run this separately, otherwise the plotly part doesn't work.
ggplotly(pirrSAsia)
```
Very few countries here, countries in this area are large. Very high levels of irrigation here coming from most countries. Bangladesh, Pakistan, and India are heavily irrigated compared to their land area. 


## Middle East and North Africa
```{r Middle East and North Africa Irrigated Totals Graph, echo=FALSE, warning=FALSE}
pirrME <-
  ggplot(data = subset(aei, six_regions == "middle_east_north_africa"), 
         mapping = aes(x=year, y= irrperc, color = country)) +
  geom_line() +
  scale_x_continuous() +
  theme(panel.grid = element_blank()) 

#run this separately, otherwise the plotly part doesn't work.
ggplotly(pirrME)
```

Irrigation is also prevalent here (but not as much as in south Asia). These countries are dry and therefore would need irrigation to have domestic food supplies. Lebanon, Iraq and Israel lead here. Oil producing countries have little irrigation, they are dry and their GDP is high, perhaps importation is the easiest option for them?

***

Overall, this might be a good overview of what irrperc looks like for each region. 
```{r message=FALSE, warning=TRUE}

theme_set(theme_minimal())

ggplot(subset(aei, !is.na(irrperc)),
       aes(x=six_regions, y=irrperc, fill=six_regions)) + 
  geom_boxplot(show.legend = FALSE)  + 
  coord_flip() +
  theme(legend.title = element_blank())
```

*** 

# References
